CONTENT-BASED VIDEO COMPRESSION 



BACKGROUND OF THE INVENTION 

The invention relates to electronic video methods and devices, and, more 
particularly, to digital communication and storage systems with compressed 
video. 

5 Video coimnunication (television, teleconferencing, and so forth) typically 

transmits a stream of video frames (images) along with audio over a 
transmisdon channel for real time viewing and listening by a receiver. 
However, transmission channels frequently add corrupting noise and have 
limited bandwidth (e.g., television channels limited to 6 MHz). Consequently, 

10 digital video transmission with compression enjoys widespread \ise. In 
particular, various standards for compression of digital video have emerged and 
include H.261, MPE(]r-l, and MPEG-2, with more to follow, including in 
development H.263 and MPEG-4. There are similar audio compression 
methods such as CELP and MELP. 

15 Tekalp, Digital Video Processing (Prentice Hall 1995), Clarke, Digital 

Compression of Still Images and Video (Academic Press 1995), and Schafer et 
al, Digital Video Coding Standards and Their Role in Video Communications, 
83 Proc. IEEE 907 (1995), include summaries of various compression methods, 
including descriptions of the H.261, MPEC5-1, and MPEG-2 standard^ plus the 

20 H.263 recommendations and indications of the desired fimctionalities of MPE(x- 
4. These references and all other references dted are hereby incorporated by 
reference. 

H.261 compression uses interframe prediction to reduce temporal 
redimdancy and discrete cosine transform (OCT) on a block level together with 
25 high spatial frequency cutofiT to reduce spatial redundancy. _H.261 is 
recommended for use with transmission rates in multiples of 64 Kbps (kilobits 
per second) to 2 Mbps (megabits per second). 
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The H.263 recommendation is analogous to H,261 but for bitrates of 
about 22 Kbpa (twisted pair telephone wire compatible) and with motion 
estimation at half-pixel accuracy (which eliminates the need for loop filtering 
available in H.261) and overlapped motion compensation to obtain a denser 
5 motion field (set of motion vectors) at the expense of more computation and 
adaptive switching between motion compensation with 16 by 16 macroblock and 
8 by 8 blocks. 

MPEG-1 and MPEG-2 also use temporal prediction followed by two 
dimensional DCT transformation on a block level as H261, but they make 

10 further use of various combinations of motion-compensated prediction, 
interpolation, and intrafirame coding. MPEGr-1 aims at video CDs and works 
well at rates about 1-1.5 Mbps for firames of about 360 pixels by 240 lines and 
24-30 firames per second. MPECj-1 defines I, P, and B firames with I fi-ames 
intraframe, P firames coded using motion-compensation prediction firom previous 

15 I or P firames, and B firames using motion-compensated bi-directional 
predictionAnterpolation firom adjacent I and P fii^ames. 

MPEG-2 aims at digital television (720 pixels by 480 lines) and uses 
bitrates up to about 10 Mbps with MPE&-1 type motion compensation with I, 
P, and B firames plus adds scalability (a lower bitrate may be extracted to 

20 transmit a lower resolution image). 

However, the foregoing MPEG compression methods result in a number 
of unacceptable artifacts such as blocMness and xmnatural object motion when 
operated at very-low-bit-rates. Because these techniques use only the 
statistical dependencies in the signal at a block level and do not consider the 

25 semantic content of the video stream, artifacts are introduced at the block . 
boimdaries under very-low-bit-rates (high qi^antization factors). Usually these 
block botmdaries do not correspond to physical boundaries of the moving objects 
and hence visually annoying artifacts result. Unnatural motion arises when 
the limited bandwidth forces the firame rate to fall below that required for 

30 smooth motion. 
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MPEG-4 is to apply to tranamiasion bitrates of 10 Kbps to 1 Mbps and 
is to use a content-based coding approach with functionalities such as 
scalability, content-based manipulations, robustneaa in error prone 
environments, multimedia data access tools, improved coding efficiency, ability 
5 to encode both graphics and video, and improved random access. A video 
coding scheme is considered content scalable if the number and/or quaUty of 
simultaneous objects coded can be varied. Olgect scalability refers to 
controlling the number of simultaneous objects conded and quality scalability 
refers to controlling the spatial and/or temporal resolutions of the coded objects. 

10 Scalability is an important feature for video coding methods operating across 
transmission channels of limited bandwidth and also channels where the 
bandwidth is dynamic. For example, a content-scalable video coder has the 
abiUty to optimize the performance in the face of limited bandwidth by encoding 
and transmitting only the important objects in the scene at a high quality. It 

15 can then choose to either drop the remaining objects or code them at a much 
lower quality. When the bandwidth of the chaimel increases, the coder can 
then transmit additional bits to improve the quality of the poorly coded objects 
or restore the missing objects. 

Musmann et al, Object-Oriented Analysis-Synthesis Coding of Moving 

20 Images, 1 Sig. Proc: Image Comm. 117 (1989), illustrates hierarchical moving 
object detection using source models. Tekalp, chapters 23-24 also discusses 
object-based coding. 

Medioni et al, Comer Detection and Curvature Representation Using 
Cubic B-Splines, 39 Comp.Vis.Grph.Image Processing, 267 (1987), shows 

25 encoding of curves with B-Splines. Similarly, Foley et al, Computer Graphics 
(Addison-Wesley 2d Ed.), pages 491-495 and 504-507, discusses cubic B-splines 
and Catmull-Rom splines (which are constrained to pass through the control 
points). 

In order to achieve e£Scient transmission of video, a system must utilize 
30 compression schemes that are bandwidth efficient. The compressed video data 
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is then transmitted over communication channels which are prone to errors. 
For video coding schemes which exploit temporal correlation in the video data, 
channel errors result in the decoder losing synchronization with the encoder. 
Unless suitably dealt with, this can result in noticeable degradation of the 
5 picture quality. To TYiflinfnin satisfactory video quality or quality of service, it 
is desirable to use schemes to protect the data from these channel errors. 
However, error protection schemes come with the price of an increased bitrate. 
Moreover, it is not possible to correct all possible errors usixig a given error- 
control code. Hence, it becomes necessary to resort to some other techniques 
10 in addition to error control to effectively remove annoying and visually 
disturbing artifacts introduced by these channel induced errors. 

In fact, a typical channel, such as a wireless channel, over which 
compressed video is transmitted is characterized hy hig^i random bit error rates 
(BER) and multiple burst errors. The random bit errors occur with a 
15 probability of aroimd 0.001 and the burst errors have a duration that usually 
lasts up to 24 milliseconds (msec). 

Error correcting codes such as the Reed-Solomon (RS) codes correct 
random errors up to a designed nmnber per block of code symbols. Problems 
arise when codes are used over channels prone to btirst errors because the 
20 errors tend to be clustered in a small number of received symbols. The 
commercial digital music compact disc (CD) uses interleaved codewords so that 
channel bursts may be spread out over multiple codewords upon decoding. In 
particular, the CD error control encoder uses two shortened RS codes with 8-bit 
symbols from the code alphabet GF(256). Thus 16-bit sound samples each take 
25 two information ^rmbols. First, the samples are encoded twelve at a time (thus 
24 symbols) by a (28,24) RS code, then the 28-8ymbol codewords pass a 
28-branch interleaver with delay increments of 28? symbols between brmches. 
Thus 28 successive 28-symbol codewords are interleaved symbol by symbol. 
After the interleaving, the 28-symbol blocks are encoded with a (32,28) RS 
30 coder to output 32-symbol codewords for transmission. The decoder is a mirror 
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image: a (32,28) RS decoder, 28-branch demterleaver with delay increment 4 
symbols, and a (28,24) RS decoder. The (32^8) RS decoder can correct 1 error 
in an input 32-symbol codeword and can output 28 erased s3nnbois for two or 
more errors in the 32-syznbol input codeword. The deinterleaver then spreads 

5 these erased symbols over 28 codewords* The (28,24) RS decoder is set to 
detect up to and including 4 symbol errors which are then replaced with erased 
symbols in the 24-symbol output words; for 5 or more errors, all 24 symbols are 
erased. This corresponds to erased music samples. The decoder may 
interpolate the erased music samples with adjacent samples. Generally, see 

10 Wickes, Error Control Systems for Digital Communication and Storage 
(Prentice Hall 1995), 

There are several hardware and software implementations of the H.261, 
MPEG-1, and MPEG-2 compression and decompression. The hardware can be 
single or multichip integrated circuit implementations (see Tekalp 

15 pages 455-456) or general pxirpose processors such as the Ultrasparc or 
TMS320C80 running appropriate software. Pioblic domain software is available 
from the Portable Video Research Group at Stanford University. 



SUMMARY OF THE INVENTION 

The present invention provides content-based video compression with 

20 difference region encoding instead of strictiy moving object encoding, blockwise 
contotir encoding, motion compensation failure encoding connected to the 
blockwise contotir tiling, subband including wavelet encoding restricted to 
subregions of a frame, scalability by uncovered background associated with 
objects, and error robustness through embedded synchronization in each moving 

25 object's code plus coder feedback to a deinterleaver. It also provides video 
systems with applications for this compression, such as video telephony and 
fixed camera surveillance for security, including time-lapse surveillance, with 
digital storage in random access memories. 
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Advantages include efi&dent low bitrate video encoding with object 
scalability and error robustness with very-low-bit-rate video compression which 
allows convenient transmission and storage. This permits low bitrate 
teleconferencing and also surveillance information storage by random access 
5 hard disk drive rather than serial access magnetic tape. And the segmentation 
of moving objects permits concentration on any one or more of the moving 
objects (MPEG^). 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The drawings are schematic for clarity. 

Figure 1 shows a preferred embodiment telephony system. 

Figure 2 illustrates a preferred embodiment surveillance sjrstem. 

Figure 3 is a flow diagram for a preferred embodiment video 
compression. 

Figures 4a-d show motion segmentation. 

Figures 5a-g illustrate boundary contour encoding^ 

Figure 6 shows motion compensation. 

Figure 7 illustrates motion failure regions. 

Figure 8 shows the control grid on the motion fisdlure regions. 

Figure 9a-b show a single wavelet filtering stage. 

Figures lOa-c illustrates wavelet decomposition. 

Figure 11 illustrates a zerotree for wavelet coefficient quantization. 

Figure 12 is a wavelet compressor block diagram. 

Figures 13a-v show scalability steps. 

Figures 14a-b are a scene with and without a particular object. 
Figures 15a-b show an error correcting coder and decoder. 
Figures 16a-b illustrate decoder feedback. 
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DESCRIPTION OF THE PREFEKEIED EMBODIMENTS 
Overview of Compression and Decompression 

Figure 1 illustrates in block diagram a preferred embodiment video- 
telephony (teleconferencing) system which transmits both speech and an image 

5 of the speaker using preferred embodiment compression, encoding, decoding, 
and decompression including error correction with the encoding and decoding. 
Of course. Figure 1 shows only transmission in one direction and to only one 
receiver, in practice a second camera and second receiver would be used for 
transmission in the opposite direction and a third or more receivers and 
10 transmitters could be connected into the system* The yideo and speech are 
separately compressed and the allocation of transmission channel bandwidth 
between video and speech may be dynamically adjusted depending upon the 
situation. The costs of telephone network bandwidth demand a low-bit-rate 
transmission. Indeed, very-low-bit-rate video compression finds use in 

15 multimedia appUcations where visual quality may be compromised. 

Figtare 2 shows a first preferred entibodiment surveillance system, 
generally denoted by reference nimieral 200, as comprising one or more fixed 
video cameras 202 focussed on stationary backgroimd 204 (with occasional 
moving objects 206 passing in the field of view) plus video compressor 208 

20 together with remote storage 210 plus decoder and display 220, 
Compressor 208 provides compression of the stream of video images of the 
scene (for example, 30 fi:*ames a second with each fi:ame 176 by/ 144 8-bit 
monochrome pixels) so that the data transmission rate firom compressor 208 to 
storage 210 may be very low, for example 22 Kbits per second, while retaining 

25 high quality images. System 200 relies on the stationary backgroimd and only 
encodes moving objects (which appear as regions in the firames which move 
relative to the background) with predictive motion to achieve the low data rate. 
This low data rate enables simple transmission channels firom cameras to 
monitors and random access memory storage such as magnetic hard disk drives 

30 available for personal computers. Indeed, a single telephone line with a modem 
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may transmit the compressed video image stream to a remote monitor* 
Further, storage of the video image stream for a time interval, such as a day 
or week as required by the particular surveillance situation, will require much 
less memory after such compressioru 

5 Video camera 202 may be a CCD camera with an incamera analog-to- 

digital converter so that the output to compressor 208 is a sequence of digital 
frames as generally illustrated in Figure 2; alternatively, analog cameras with 
additional hardware may be used to generate the digital video stream of 
frames. Compressor 208 may be hardwired or, more conveniently, a digital 

10 signal processor (DSP) with the compression steps stored in onboard memory, 
RAM or ROM or both. For example, a TMS320C50 or TMS320C80 type DSP 
may suffice. Also, for a teleconferencing system as shown in Figure 1, error 
correction with real time reception may be included and implemented on 
general purpose processors. 

15 Figure 3 shows a high level flow diagram for the preferred embodiment 

video compression methods which include the following steps for an input 
consisting of a sequence of frames, Fq, Fj^, F2, with each frame 144 rows of 
176 pixels or 288 rows of 352 pixels and with a frame rate of 10 frames per 
second. Details of the steps appear in the following sections. 

20 Frames of these two sizes partition into arrays of 9 rows of 11 

macroblocks with each macroblock being 16 pixels by 16 pixels or 18 rows of 22 
macroblocks. The frames will be encoded as I pictiares or P pictures; B pictures 
with their backward interpolation would create overly large time delays for very 
low bitrate transmission. An I picture occurs only once every 5 or 10 seconds, 

25 and the majority of frames are P pictures. For the 144 rows of 176 pixels size 
frames, roughly an I picture will be encoded with 20 Kbits and a P picture with 
2 Kbits, so the overall bitrate will be roughly 22 Kbps (only 10 frames per 
second or less). The frames may be monochrome or color with the color given 
by an intensity frame (Y signal) plus one quarter resolution (subsampled) color 

30 combination frames (U and V signals). 
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(1) Initially, encode the zeroth frame Fq as an I picture like in MPEG-1^ 
using a waveform coding technique based on the DCT or wavelet transform. 
For the DCT case, partition the frame into 8 by 8 blocks; compute the DCT of 
each block; cutoff the high spatial frequencies; quantize and encode the 
remaining frequencies, and transmit. The encoding includes run length 
encoding, then HuflVnan encoding, and then error correction encoding. For the 
wavelet case, compute the multi-level decomposition of the frame; quantize and 
encode the resulting wavelet coefficients^ and transmit. Other frames will also 
be encoded as I pictures with the frequency dependent upon the transmission 
channel bitrate. And for Fj^ to be an I picture, encode in the same manner. 

(2) For frame F^ to be a P picture, detect moving objects in the frame by 
finding the regions of change from reconstructed F^.j^ to F^. Reconstructed 
F^.^ is the approximation to Fj^.^ which is actually transmitted as described 
below. Note that the regions of change need not be partitioned into moving 
objects plus uncovered background and will only approximately describe the 
moving objects. However, this approximation suffices and provides more 
efficient low coding. Of course, an alternative would be to also make this 
partition into moving objects plus xmcovered backgroxmd through mechanisms 
such as inverse motion vectors to determine if a region maps to outside of the 
change region in the previous frame and thus is tmcovered background, edge 
detection to determine the object, or presumption of object characteristics 
(models) to distinguish the object from backgroxmd. 

(3) For each connected component of the regions of change from step (2), 
code its boundary contoxur, including any interior holes. Thus the boundaries 
of moving objects are not exactly coded; rather, the boundaries of entire regions 
of change are coded and approximate the boundaries of the moving objects. The 
boimdary coding may be either by splines approximating the boundary or by a 
binary mask indicating blocks within the region of change. The spline provides 
more accurate representation of the boundary, but the binary mask uses a 
smaller number of bits. Note that the connected components of the regions of 
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change may be determined by a raster scanning of the binary image mask and 
sorting pixels in the mask into groups, which may merge, according to the 
sorting of adjacent pixels. The final groups of pixels are the connected 
components (connected regions). For example of a program, see Ballard et al, 
Computer Vision (Prentice Hall) at pages 149-152. For convenience in the 
following the connected components (connected regions) may be referred to as 
(moving) objects. 

(4) Remove temporal redundancies in the video sequence by motion 
estimation of the objects from the previous firame. In particular, match a 16 by 
16 block in an object in the current firame Fjj with the 16 by 16 block in the 
same location in the preceding reconstructed firame Fjj.j plus translations of 
this block up to 15 pixels in all directions. The best match defines the motion 
vector for this block, and an approximation F*jf to the current frame F jj can be 
synthesized firom the preceding firame Fjj.^ by using the motion vectors with 
their corresponding blocks of the preceding firame.^ 

(5) Afl^er the use of motion of objects to synthesize an approximation Fjj, 
there may still be areas within the firame which contain a significant amoimt 
of residual information, such as for fast changing areas. That is, the regions 
of difference between and the synthesized approximation have motion 
segmentation applied analogous to the steps (2)-(3) to define the motion failure 
regions which contain significant information. 

(6) Encode the motion &dlure regions from step (5) using a waveform 
coding technique based on the DCT or wavelet transform. For the DCT case, 
tile the regions with 16 by 16 macroblocks, apply the DCT on 8 by 8 blocks of 
the macroblocks, quantize and encode (runlength and then Hufi&nan coding). 
For the wavelet case, set all pixel values outside the regions to zero, apply the 
multi-level decomposition, quantize and encode (zerotree and then arithmetic 
coding) only those wavelet coeffidencts corresponding to the selected regions. 

(7) Assemble the encoded information for I pictures (DCT or wavelet 
data) and P pictures (objects ordered with each object having contour, motion 
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vectors, and motion failure data). These can be codewords from a table of 
H uffman codes; this is not a dynamic table but rather generated 
experimentally. 

(8) Insert resynchronization words at the beginning of each I picture 
5 data, each P picture, each contour data, each motion vector data, and each 

motion failure data. These resynchronization words are unique in that they do 
not appear in the Huf&nan codeword table and thus can be imambiguously 
determined. 

(9) Encode the resulting bitstream from step (8) with Reed-Solomon codes 
10 together with interleaving. Then transmit or store. 

(10) Decode a received encoded bitstream by Reed-Solomon plus 
deinterleaving. The resynchronization words help after decoding failure and 
also provide access points for random access. Further, the decoding may be 
with shortened Reed -Solomon decoders on either side of the deinterleaver plus 

15 feedback from the second decoder to the first decoder (a stored copy of the 
decoder input) for enhanced of error correction. 

(11) Additional functionalities such as object scalability (selective 
encoding/decoding of objects in the sequence) and quality scalability (selective 
enhancement of the qxiality of the objects) which result in a scalable bitstream 

20 are also supported. 

Moving O bject Detection and segmentation 

The first preferred embodiment method detects and segments moving 
objects by use of regions of difference between successive video firames but does 
not attempt to segregate such regions into moving objects plus uncovered 

25 backgroimd. This simplifies the information but appears to provide sufficient 
quality. In particular, for firame Pjj at each pixel find ttie absolute value of the 
difference in the intensity (Y signal) between Pjj and reconstructed Pjj_^. For 
8-bit intensities (256 levels labelled 0 to 255), the camera calibration variability 
would suggest taking the intoisity range of 0 to 15 to be dark and the 

30 range 240-255 to be saturated brightness. The absolute value of the intensity 
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difference at a pixel will lie in the range firom 0 to 255, so eliminate minimal 
differences and form a binary image of differences by thresholding (set any 
pixel absolute difference of less fh^r\ or equal to 5 or 10 (depending upon the 
scene ambient illumination) to 0 and any pixel absolute difference greater than 
30 to 1). This yields a binary image which may appear speckled: Figures 4a-b 
illustrate two successive frames and figure 4c tiie binary image of thresholded 
absolute difference with black pixels indicating Is and indicating significant 
differences and the white background pixels indicating Os. 

Then etiminate m\M isolated areas in the binary image, such as would 
result from noise, by median filtering (replace a 1 at a pixel with a 0 if the 4 
(8?) nearest neighbor pixels are all Os). 

Next, apply the morphological dose operation (dilate operation followed 
by erode operation) to fill-in between dose by Is; that is, replace the speckled 
areas of Figure 4c with solid areas. Use dilate and erode operations with a 
circular kernel of radius K pixels (K may be 11 for QCIF frames and 13 for GIF 
frames); in particular, the dilate operation replaces a 0 pixel with a 1 if any 
other pixel within K pixels of the original 0 pixel is a 1 pixel, and the erode 
operation replaces a 1 pixel with a 0 unless all pixels within K pixels of the 
original 1 pixel are all also 1 pixels. After the dose operation, apply the open 
operation (erode operation followed by dilate operation) to remove small 
isolated areas of Is. This yields a set of connected components (regions) of 1 
pixels with fjEorly smooth botmdaries as illustrated in Figure 4d. N6te that a 
connected component may have one or more interior holes which also provide 
boimdary contours. 

Then raster scan the binary image to detect and label connected regions 
and their boundary contours (a prsel which is a 1 and has at least one nearest 
neighbor pixel which is a 0 is deemed a boundary contour pixel). A procedure 
such as ccomp (see Ballard reference or the Appendix) can accomplish this. 
Each of these regions presumptivdy indicates one or more moving objects plus 
background uncovered by the motion. Small regions can be disregarded by 
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using a threshold such as a Tninimnm difference between extreme boiinda37 
pixel coordinates. Such atriflll regions may grow in succeeding frames and 
eventually arise in the motion failure regions of a later frame. Of course, a 
connected region cannot be smaller the K-pixel-radius dilate/erode kernel, 
5 otherwise it would not have survived the open operation. 

Contour Representation 

The preferred embodiments have an option of boundary contoiir ^coding 
by either spline approximation or blocks straddling the contour; this permits 
a choice of either high resolution or low resolution and thus provides a 
10 scalability. The boimdary contour encoding with the block representation takes 
fewer bits but is less accurate t.Hfln the spline representation. Thus a tradeoff 
exits which may be selected according to the application, 
(i) Block boxmdary contour representation. 

For each of the connected regions in the binary image derived from 
15 in the preceding section, find the bounding rectangle for the region by finding 
the smallest and largest boundary pixel x coordinates and y coordinates: the 
smallest x coordinate (xg) and the smallest y coordinate (yg) define the lower 
lefthand rectangle comer (xQ,yo) and the largest coordinates define the upper 
righthand comer (x^^yi); see Figure 5a showing a connected region and Figure 
20 5b the region plus the bounding rectangle. 

Next, tile the rectangle with 16 by 16 macroblocks starting at (x^,yo) and 
with the macroblocks extending past the upper and/or righthand edges if the 
rectangles sides are not multiples of 16 pixels; see Figure 5c illustrating a 
tiling. If the tiling would extend outside of the firame, then translate the comer 
25 (xo,yo) to j^st keep the tiling within the firame. 

Form a bit map with a 1 representing the tiling macroblocks that have 
at least 50 of their 256 pixels (i.e., at least about 20%) on the boundary or 
inside the region and a 0 for macroblocks that do not. This provides the block 
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description of the boundary contour, the starting comer (xqOTo) and the bit map. 
See Figure 5d showing the bit map. 

The comer plus bit map information will be transmitted if the region is 
small; that is, if at most 3 or 4 macroblocks tUe the bounding rectan^e. In case 
5 the region is larger, a more efficient coding proceeds as follows. First, compare 
the bit map with the bit maps of the previous frame, typically the previous 
frame has only 3 or 4 bit maps. If a bit map match is found, then compare the 
associated comer, (x'o.y'o), of the previous frame's bit map with {xQ,y^. Then 
if (^O'y' 0^ equals (xoOTq), a bit indicating the como' and bit map matching those 
10 of the previous frame can be transmitted instead of the full bit map and comer. 
Figure 5d sxiggests this single bit contour transmission. 

Similarly, if a bit map match is found with a bit map of the previous 
frame but the associated comer (x'qj/q) does not equal (Xo,yo), then transmit a 
translation vector [(x'o,y'o)-(xo,yo)] instead of the full bit map and comer. This 
15 translation vector typically will be fsasiy small because objects do not move too 
much frame-to-frame. See Figure 5e, 

Further, if a bit map match is not found, but the bit map difTerence is not 
large, such as only 4 or 5 macroblock differences, both added and removed, then 
transmit the locations of the changed macroblocks pliis any translation vector 
20 of the associated rectangle comers, (x^oyo^-^^o^yo^- See Figure 5f 

Lastly, for a large difference in macroblocks, just transmit the comer 
(^»yo) run length encode the bit map along rows of macroblocks in the 
boxmding rectangle as illustrated in Figure 5g for transmission. Note that 
large-enov^h holes within the region plus projections can give rise to multiple 
25 runs in a row. 

(ii) Spline boundary contour representation: 

For each connected region derived in the preceding section find comer 
points of the boundary contour(s), including of any interior holes, of the region. 
Note that a region of size roughly 50 pixels in diameter will have very roughly 
30 200-300 pixels in its boundary contour, so use about 20% of the pixels in a 
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contour representation. A Catmull Rom spline (see the Foley reference or the 
Appendix) fit to the comer points approximates the boimdary. 

Mntinn Batinifltinn 

For each connected region and bit map derived from Fjj in the preceding 
5 section, estimate the motion vector(s) of the region as follows. First, for each 
16 by 16 macroblock in Fj^ which corresponds to a macroblock indicated by the 
bit map to be within the region, compare this macroblock with macroblocks in 
the previous reconstructed frame, Fjj.j, which are translates of up to 15 pixels 
(the search area) of this macroblock in Fj^. The comparison is the sum of the 
10 absolute differences in the pixel intensities of the selected macroblock in Fj^ 
and the compared macroblock in F^^^ with the simi over the 256 pixels of the 
macroblock. The search is performed at a sub-pixel resolution (half pixel with 
interpolation for comparison) to get a good match and extends 15 pixels in all 
directions. The motion vector corresponding to the translation of the selected 
15 macroblock of Fj^ to the F^.j^ macroblock(s) with minimnTn sum differences can 
then be taken as an estimate of the motion of the selected macroblock. Note 
that use of the same macroblock locations as in the bit map eliminates the need 
to transmit an additional starting location. See Figure 6 indicating a motion 
vector. 

20 If the minimnTn sum differences defining the motion vector is above a 

threshold, then none of the macroblocks searched in Fj^_^ sufficiently matches 
the selected macroblock in F^ and so do not use the motion vector 
representation. Rather, simply encode the selected macroblock as an I block 
(intraframe encoded in its entirety) and not as a P block (predicted as a 

25 translation of a block of the previous fi-ame). 

Next, for each macroblock having a motion vector, subdivide the 
macroblock into four 8 by 8 blocks in Fjj and repeat the comparisons with 
translates of 8 by 8 blocks of F^^^ to find a motion vector for each 8 by 8 block. 
If the total nimiber of code bits needed for the four motion vectors of the 8 by 
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8 blocks is less than the number of code bits for the motion vector of 16 by 16 
macroblock and if the weighted error with the use of four motion vectors 
compared to the single macroblock motion vector, then use the 8 by 8 block 
motion vectors. 

5 Average the motion vectors over all macroblocks in Pjj which are within 

the region to find an avenge motion vector for the entire region. Then if none 
of the macroblock motion vectors differs firom the average motion vector by 
more than a threshold, only the average motion need be transmitted. Also, the 
average motion vector can be used in error recovery as noted in the following 
10 Error Concealment section. 

Thus for each connected region foimd in Fjj by the foregoing 
segmentation section, transmit the motion vector(s) plus bit map. Typically, 
teleconferencing with 176 by 144 pixel fi:^es will require 100-150 bits to 
encode the shapes of the expected 2 to 4 connected regions plus 200-300 bits for 
15 the motion vectors. 

Also, the optional 8 by 8 or 16 by 16 motion vectors and overlapped 
motion compensation techniques may be used. 

Motion Failure Region nftterHnn, 

An approximation to can be synthesized from reconstructed F^^^ by 
20 use of the motion vectors plus correspondii^ (macro) blocks from F^ ^ as found 
in the preceding section: for a pixel in the portion of F^ lying outside of the 
difference regions found in the Segmentation section, just use the value of the 
corresponding pixel in Fjj.j^, and for a pixel in a connected region, use the value 
of the corresponding pixel in the macroblock in F^^^ which the motion vector 
26 translates to the macroblock in F^ containing the pixel. The pixels in Fjj with 
intensities which differ by more tb^n a threshold from the intensity of the 
corresponding pixel in the approximation synthesized by use of the motion 
vectors plus corresponding (macro)blocks fit)m F^^^ represent a motion 
compensation failure region. To handle this motion fisdlvire region, the intensity 
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differences are thresholded, next median filtered, and subjected to the 
morphological close and open operations in the same manner as the differences 
firom F-^^i to Fjj described in the foregoing object detection and segmentation 
section. Note that the motion failure regions will lie inside of moving object 
5 regions; see Figure 7 as an illustration. 

If a spUne boundary contour was used, then only consider the portion of 
a macroblock inside the boundary contour. 

Residual Signal Encoding 

Encode the motion failtire regions as follows: tile these motion failure 
10 regions with the 16 by 16 macroblocks of the bit map of the foregoing boimdary 
contour section, this eliminates the need to transmit a starting pixel for the 
tiling because it is the same as for the bit map. This also means that the tiling 
moves with the object and thus may lessen the changes. 

For the motion failure regions, in each macroblock simply apply DCT 
15 with quantization of coefficients and runlength encoding and then Huffinan 
encoding. See Figure 8 showing the macroblocks within the grid. 

A preferred embodiment motion fadlure region encoding uses wavelets 
instead of DCT or DPCM. In particular, a preferred embodiment uses a 
wavelet transform on the macroblocks of the motion failure region as illustrated 
20 in Figure 8. Recall that a wavelet transform is traditionally a full firame 
transform based on translations and dilations of a mother wavelet, WO, and a 
mother scaUng function, $(); both Y() and *() are essentially nonzero for only 
a few adjacent pixels, depending upon the particular mother wavelet. Then 
basis fimctions for a wavelet transform in one dimension are the Wn,m^^^ " ^ 
25 Y(2'°H - n) for integers n and m. Y() and $() are chosen to make the 

translations and dilations orthogonal analogous to the orthogonality of the 
sin(kt) and cosOct) so a transform can be easily computed by integration 
(summation for the discrete case). The two dimensional transform simply uses 
basis functions as the products of ^^)s in each dimension. Note that the 
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index n denotes translations and the index m denotes dilations. Compression 
arises from quantization of the transformation coe£ELcients analogous to 
compression with DCT. See for example, Antonini et al, Image Coding Using 
Wavelet Transform, 1 IEEE Tran. Image Proc. 205 (1992) and Mallat, A Theory 
5 for Multiresolution Signal Decomposition: The Wavelet Representation, 11 
IEEE Tran. Patt. Anal. Mach. Intel. 674 (1989) for discussion of wavelet 
transformations. For discrete variables the wavelet transformation may also 
be viewed as subband filtering: the filter outputs are the reconstructions from 
sets of transform coefficients. Wavelet transformations proceed by successive 
10 stages of decomposition of an image tbrou^ filterii^ into four subbands: 
lowpass horizontally with lowpass vertically, highpass horizontally with 
lowpass vertically, lowpass horizontally with highpass vertically, and h^hpass 
both horizontally and vertically. In the first stage the highpass filtering is 
convolution with the translates Yq^i and the lowpass is convolution with the 
15 scaling function translates At the second stage the output of the first 
stage subband of lowpass in both horizontal and vertical is again filtered into 
four subbands but with highpass filtering now convolution with ¥^^2 which in 
a sense has half the frequency of Tj^^il similarly, the lowpass filtering is 
convolution with Pig^es 9a-b illustrate the four subband filterings with 
20 recognition that each filtered image can be subsampled by a factor of 2 in each 
direction, so the four output images have the same number of pixels as the 
original input image. The preferred embodiments may use bidrthogonal 
wavelets which provides filters with linear phase. The biorthogonal wavelets 
are similar to the orthogonal wavdets described above but use two related 
25 mother wavelets and mother scaling fimctions (for the decomposition and 
reconstruction stages). See for example, Villasenor et al. Filter Evaluation and 
Selection in Wavelet Image Compression, IEEE Proceedings of Data 
Compression Conference, Snowbird, Utah (1994) which provides several 
examples of good biorthogonal wavelets. The preferred embodiment may use 
30 the (6,2) tap filter pair from the Villasenor paper which has low pass filter 
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coeffidents of: = 0.707107 = 0.707107 and go = -0.088388 = 
0.088388 g2 = 0.707107 gg = 0.707107 g4 = 0.088388 gg = -0.088388 for the 
analysis and synthesis filters. 

Preferred embodiment wavelet transforms generally selectively code 
5 information in only regions of interest in an image by coding only the regions 
in the subbands at each stage which correspond to the original regions of 
interest in the original image. See Pigmres lOa-c. heuiistically illustrating how 
regions appear in the subband filtered outputs. This approach avoids spending 
bits outside of the regions of interest and improves video quality. The specific 
10 use for motion failure regions is a special case of only encoding regions of 
interest. Note that the thesis of H. J. Barnard ("Image and Video Coding Using 
a Wavelet Decomposition", Technische Universiteit Delft, 1994) segments an 
image into relatively homogeneous regions and then uses different wavelet 
transforms to code each region and only considered single images, not video 
15 sequences. Barnard's method also requires the wavelet transformation be 
modified for each region shape; this adds complexity to the filtering stage and 
the coding stage. The preferred embodiments use a single filtering transform. 
Further, the preferred embodiment appUes to regions of interest, not just 
homogeneous regions as in Barnard and which fill up the entire frame. 
20 The preferred embodiments represents regions of interest with an image 

map. The map represents which pixels in a given image He within the regions 
of interest. The simplest form is a binary map representing to be coded or not 
to be coded. If more than two values are used in the map, then varying 
priorities can be given to different regions. This map must also be ti^nsmitted 
25 to the decoder as side information: For efficiency, the map information can be 
combined with other side information such as motion compensation. 

The map is used during quantization. Since the wavelets decompose the 
image into subbands, the first step is to transfer the map to the subband 
structure (that is, determine which locations in the subband output images 
30 correspond to the original map). This produces a set of subregions in the 
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subbands to be coded. Figures lOa-c show the subregions: Pigiire 10a shows 
the original image map with the regions of interest shown, and Figure 10b 
shows the four subband outputs with the corresponding regions of interest to 
be coded after one stage of decomposition. Figure 10c shows the subband 

5 structure after two stages and with the regions of interest. 

The preferred embodiment first sets the pixels outside of the regions of 
interest to 0 and then applies the wavelet decomposition (siibband filtering 
stages). After decomposition and during the quantization of the wavelet 
transform coefficients, the encoder only sends information about values that lie 

10 within the subregions of interest to be coded. The quantization of coefficients 
provides compression analogous to DOT transform coefficient quantization. 
Experiments show that the video qualify increases with compression using the 
regions of interest approach as compared to not using it. 

There is some slight sacrifice made in representing the values near the 

15 edges of the selected regions of interest because the wavelet filtering process 
will smear the information somewhat and any information that smears outside 
the region of interest boimdary is lost. This means that there is no guarantee 
of perfect reconstruction for values inside the region of interest even if the 
values in the regions of interest were perfectiy coded. In practice, this does not 

20 seem to be a severe hardship because the level of qxiantization reqinred for 
typical compression applications means that the images are far from any 
perfect reconstruction levels anyway and the small effect near the edges can be 
ignored for all practical purposes. 

The preferred embodiments may use the zerotree quantization method 

25 for the transform coefficients. See Shapiro, Embedded Image Coding Using 
Zerotrees of Wavelet coefficients, 41 IEEE Trans. Sig. Proc. 3445 (1993) for 
details of the zerotree method applied to single images. The zerotree method 
implies that the only zerotrees that lie within the subregions of interest are 
coded. Of course, other quantization methods could be used instead of zerotree. 

30 Figure 11 illtistrates the zerotree relations. 
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In applications the regions of interest can be selected in many vraySj such 
as areas that contain large ntimbers of errors (such £is quantizing video after 
motion compensation) or areas corresponding to perceptually important images 
features (such as faces) or objects for scalable compression. Having the ability 
to select regions is especially useful in motion compensated video coding where 
quantization of residual images typically contain information concentrated in 
areas of motion rather than uniformly spread over the frame. 

Regions of interest can be selected as macroblocks which have errors that 
exceed a threshold after motion compensation. This application essentially 
combines region of interest map information with motion compensation 
information. Further, the regions of interest could be macroblocks covering 
objects and their motion failure regions as described in the foregoing. 

Figure 12 illustrates a video compressor using the wavelet transform on 
regions of interest. 

An alternative preferred embodiment uses a wavelet transform on the 
motion failure region macroblocks and these may be aligned with the 
rectangular grid. 

(1) Initially, encode the zeroth frame Fg as an I picture. Compute the 
multi-level decomposition of the entire fr^me; quantize and encode the resulting 
wavelet coefi&dents, and transmit. The preferred embodiment uses the zerotree 
method of quantization and encoding. Any subsequent frame Fjj that is to be 
an I pictxire can be encoded in the same manner. 

(2) For each frame encoded as a P picture (not an I picture), perform 
motion compensation on the input fr^une by comparing the pixel values in the 
frame with pixel values in the previous reconstructed frame. The resulting 
predicted frame is subtracted from the input fr^me to produce a residual imag e 
(different between predicted and actual pixel values). The motion compensation 
ffl p be done using the segmentation approach described earlier or simply on a 
block by block basis (as in H.263). The resulting motion vector information is 
coded and transmitted. 
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(3) For each residual image computed in step (2), determine the region 
or regions of interest that require additional information to be sent. This can 
be done using the motion failure approach described eartier or simply on a 
macroblock basis by comparing the sum of the squared residual values in a 

5 macroblock to a threshold and including only those macroblocks above the 
threshold in the region of interest. This step produces a region of interest map. 
This map is coded and transmitted. Because the map information is correlated 
with the motion vector information in step (2), an alternative preferred 
embodiment codes and transmits the motion vector and map information 

10 together to reduce the number of bits required. 

(4) Using the residxial image computed in step (2) and the region of 
interest map produced in step (3), values in the residual images that correspond 
to locations outside the region of interest map can be set to zero. This insures 
that values outside the region of interest will not affect values within the region 

15 of interest after wavelet decomposition. Step (4) is optional and may not be 
appropriate if the region based wavelet approach is appHed to something 
besides motion compensated residuals. 

(5) The traditional multi-level wavelet decomposition is appKed to the 
image computed in step (4). The nimiber of filtering operations can be reduced 

20 (at the cost of more complexity) by performing the filtering only within the 
regions of interest. However, because of the zeroing firom step (4), the same 
results will be obtained by performing the filtering on the entire image which 
simplifies the filtering stage. 

(6) The decomposed image produced in step (5) is ne3ct quantized and 
25 encoded. The region of interest map is used to specify which corresponding 

wavelet coefficients in the decomposed subbands are to be considered. Figure 
10 shows how the region of interest map is used to indicate which subregions 
in the subbands are to be coded. Next, all coefficients within the subregions of 
interest are quantized and encoded. The preferred ^nbodiment uses a 
30 modification of the zerotree approach by Shapiro, which combines correlation 



TI-21550/21551 Page 23 



between subbands, scalar qiiantization and arithmetic coding. The zerotree 
approach is applied to those coeflBdents within the subregions of interest. 
Other quantization and coding approaches could also be used if modified to only 
code coefficients within the subregions of interest. The output bits of the 
5 quantization and encoding step is then transmitted. The resulting quantized 
decomposed image is used in step (7). 

(7) The traditional multi-level wavelet reconstruction is appUed to the 
quantized decomposed image from step (6). The number of filtering operations 
can be reduced (at the cost of more complexity) by performing the filtering only 

10 within the regions of interest. However, because of the zeroing from step (4), 
the same results will be obtained by performing the filtering on the entire 
image which simplifies the filtering stage. 

(8) As in step (4), the reconstructed residual imc^e computed in step (7) 
and the region of interest map produced in step (3) can be used to zero values 

15 in the reconstructed residual image that correspond to locations outside the 
region of interest map. This insures that values outside the region of interest 
will not be modified when the reconstructed residual is added to the predicted 
image. Step (8) is optional and may not be appropriate if the region based 
wavelet approach is applied to something besides motion compensated 

20 residuals. 

(9) The resulting residual image from step (8) is added to the predicted 
frame from step (2) to produce the reconstructed frame (this is what the 
decoder will decode). The reconstructed frame is stored in a frame memory to 
be used to for motion compensation for the next firame. 

25 More generally, subband filtering of other types such as QMF and 

Johnston could be used in place of the wavelet filterix^ provided that ^e region 
of interest based approach is maintained. 
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Scalability 

The object oriented approach of the preferred embodiments permits 
scalability. Scable compression refers to the construction of a compressed video 
bit stream that can have a subset of the encoded information removed, for 

5 example all of the objects representii^ a particular person, and the remaining 
bitstream will still decode correctly, that is, without the removed person, as if 
the person were never in the video scenes. The removal must occur without 
decoding or receding any objects. Note that the objects may be of different 
types, such as "enhancement" objects, whose loss would not remove the object 

10 from the scene, but rather just lower the quality of its visual appearance or 
omit audio or other data linked to the object. 

The preferred embodiment scalable object-based video coding proceeds as 
follows: 

Presume an input video sequence of frames together with a segmentation 
15 mask for each frame, the mask delineates which pixels belong to which objects. 
Such a mask can be developed by difference regions together with inverse 
motion vectors for determining uncovered background plus tracking through 
frames of the connected regions, including mei^ers and separations, of the 
mask for object identification. See the background references. The frames are 
20 coded as I frames and P frames with the initial frame being an I frame and 
other I frames may occur at regular or irregular intervals thereafter. The 
intervening frames are F frames and rely on prediction frx)m the closest 
preceding I frame. For an I frame define the "I objects" as the objects the 
segmentation mask identifies; the I-objects are not just in the I frames but may 
25 persist into the P frames. Figures 13a-b illustrates a first frame plus its 
segmentation mask. 

Encode an I frame by first for forming an inverse image -of the 
segmentation mask. Then this image is blocked (covered with a minimal 
niunber of 16 by 16 macroblocks aligned on a grid), and the blocked image is 
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used as a m ask to extract the background image firom the firame. See Figures 
13c-d illustrating the blocked image and the extracted background. 

Next, the blocked mask is efficiently encoded, such as by the differential 
contour encoding of the foregoing description. These mask bits are put into the 
5 output bitstream as part of object #0 (the backgroimd object). 

Then the extracted background is efficiently encoded, such as by DCT 
encoded 16 by 16 macroblocks as in the foregoing. These bits are put into the 
output bitstream as part of object #0. 

Further, for each object in the firame, the segmentation mask for that 
10 object is blocked and encoded, and that object extracted firom the first firame via 
the blocked mask and encoded, as was done for the background image. See 
Figures 13e-f illustrating the blocked object mask and extracted object. The 
blocked mask and extracted object are encoded in the same manner as the 
background and the bits put into the output bitstream. 
15 As each object is put into the bitstream it is preceded by a header of fixed 

length wherein the object number, object type (such as I-object) and object 
length (in bits) is recorded. 

After all of the objects have been coded, a reconstructed fi:ame is made, 
combining decoded images of the background and each object into one firame. 
20 This reconstructed firame is the same firame that will be produced by the 
decoder if it decodes all of the objects. Note that overlapping macroblocks (firom 
different objects) will be the same, so the reconstruction will not be ambiguous. 
See Figures 13g-i illustrating the reconstructed background and objects and 
firame. 

25 An average firame is calculated firom the reconstructed firame. An 

average pixel value is calculated for each channel (e.g., luminance, blue, and 
red) in the reconstructed firame and those pixel values are replicated in their 
channels to create the average firame. The three average pixel vaiues are 
written to the output bitstream. This completes the I firame encoding. 
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Following the I frame, each subsequent frame of the video sequence is 
encoded as a P frame until the next, if any, I frame. The T" stands for 
"predicted" and refers to the fact that the P frame is predicted from the frame 
preceding it (I frames are coded only with respect to themselves). Note that 
5 there is no requirement in the encoder that every frame of the input is encoded, 
every third frame of a 30 Hz sequence could be coded to produce a 10 Hz 
sequence. 

As with the I frame, for a P frame block the segmentation mask for each 
object and extract the object See Figures 13j-m showing a P frame, an object 

10 mask, the blocked object mask, and the extracted object, respectively. Do not 
use object #0 (the background) because it should not be changing and should 
not need prediction. 

Next, each of the extracted objects is differenced with its reconstructed 
version in the previous frame. The block mask is then adjusted to reflect any 

15 holes that might have opened up in the differenced inuige; that is, the 
reconstructed object may closely match a portion of the object so the difference 
may be below threshold in an area within the segmentation mask, and this part 
need not be separately encoded. See Figures 13n-o showing the object 
difference and the adjusted block mask, respectively Then the block rnaak is 

20 efficiently encoded and put into the output bitstream. 

To have a truly object-scalable bitstream the motion vectors 
corresponding to the blocks tiling each of the objects should only point to 
locations within the previous position of this object. Hence in forming this 
bitstream, for each of the objects to be coded in the current image, the encoder 

25 forms a separate reconstructed image with only the reconstructed version of 
this object in the previous frame and all other objects and background removed. 
The motion vectors for the current object are estimated with respect to this 
image. Before performing the motion estimation, all the other areas of the 
reconstructed image where the object is not defined (non mask areas) are filled 

30 with an average background value to get a good motion estimation at the block 
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boundaries. This average value can be different for each of the objects and can 
be transmitted in the bitstream for use by the decoder* Figure 13p shows an 
image of a reconstructed object with the average value in the non mask areas. 
This is the image used for motion estimation. The calculated motion vectors are 
5 then efficiently encoded and put in the bitstream. 

Then the differences between the motion compensated object and the 
current object are DCT (or wavelet) encoded on a macroblock basis. If the 
differences do not meet a threshold, then they are not coded, down to an 8 by 
8 pixel granularity. Also, during motion estimation, some blocks could be 

10 designated INTRA blocks (as in an I frame and as opposed to INTER blocks for 
P frames) if the motion estimation calculated that it could not do a good job on 
that block. INTRA blocks do not have motion vbectors, and their DCT coding 
is only with respect to the current block, not a difference with a compensated 
object block. See Figures 13q-r illustrating the blocks which were DCT coded 

15 (INTRA blocks). 

Neirt, the uncovered backgroimd that the object's motion created (with 
respect to the objects position in the previous frame) is calculated and coded 
as a separate object for the bitstream. This separate treatment of the 
uncovered background (along with the per object motion compensation) is what 

20 makes the bitstream scalable (for video objects). The bitstream can be played 
as created; the object and its uncovered backgroimd can be removed to excise 
the object from the playback, or jtast the object can be extracted to play on its 
own or to be added to a different bitstream.. 

To calculate the \mcovered background, the object's original (not blocked) 

25 segmentation masks are differenced such that all of the pixels in the previous 
mask belonging to the current mask are removed. The resulting image is then 
blocked and the blocks used as a mask to extract the imcovered background 
from the current image. See Figures 13s-u illustrating the imcovered 
background pixels, a block mask for the pixels and the image within the Tpq ^h:, 
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The uncovered background image is DCT encoded as INTRA blocks 
(making the xmcovered backgrotmd objects I objects). See Figure 13v for the 
reconstructed frame. 

Decoding the bitstream for the scalable object-based video works in the 

5 same manner as the previously described decoder ^cept that it decodes an 
object at a time instead of a fi:*ame at a time. When dropping objects, the 
decoder merely reads the object header to fi^iH out how many bits long it is, 
reads that many bits, and throws them away. 

Further, quality scalability can also be achieved by providing an 

10 additional enhancement bitstream associated with each object. By decoding and 
using the e nhanc ement bitstream the quality of the selected objects can be 
improved. If the channel bandwidth does not allow for the transmission of <:bifl 
enhanced bitstream it can dropped at the encoder. Alternately the decoder may 
also optimize its performance by choosing to drop the enhancement bitstreams 

15 associated with certain objects if the application does not need them. The 
enhancement bitstream corresponding to a particular object is generated at the 
encoder by computing the differences between the object in the current frame 
and the final reconstructed object (after motion failtire region encoding) and 
again DCT (or Wavelet) encoding these differences with a lower quantization 

20 factor. Note that the reconstructed image should not be modified with these 
differences for the bitstream to remain scalable i.e., the encoder and decoder 
remain in sjmchronization even if the enhancement bitstreams for certain 
objects are dropped. 

Figures 14a-b illustrate the preferred embodiment object removal: the 

25 person on the left in Figure 14a has been removed in Figure 14b. 

Error conn ftfllTnprif. 

The foregoing object-oriented methods compress a video sequence by 
detecting moving objects (or difference regions which may include both object 
and uncovered bacl^round) in each firame and separating them fi:^m the 
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stationary background. The shape, content and motion of these objects can 
then be efficiently coded tising motion compensation and the differences, if any, 
using DC3T or wavelets. When this compressed data is subjected to channel 
errors, the decoder loses synchronization with the encoder, which manifests 
itself in a catastrophic loss of picture quality. Therefore, to enable the decoder 
to regain synchronization, the preferred embodiment resynchronization words 
can be inserted into the bitstream. These resynchronization words are 
introduced at the start of the data for an I firame and at the start of each the 
codes for the following items for every detected moving object in a P frame in 
addition to the start of the P frame: 

(i) the boundary contour data (bitmap or spline); 

(ii) the motion vector data; and 

(iii) the DCT data for the motion failure regions. 

Further, if control data or other data is also included, then this data can also 
have resynchronization words. The resynchronization words are characterized 
by the fact that they are unique; i.e., they are different from any given sequence 
of coded bits of the same length because they are not in the HufBnan code table 
which is a static table. For example, if a P frame had three moving objects, 
then the sequence would look like: 

frame begin resynchronization word 

contour resynchronization word . 

first objeclf s contour data (e.g., bitmap or spline) 

motion vector resynchronization word 

first object's motion vectors (related to bitmap macroblocks) 

DCT/wavelet resynchronization word 

first object^s motion frdlure data 

contour resynchronization word ^. . 

second object's contour data 
' motion vector resynchronization word 

second object's motion vectors 
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DCT/wavelet resynchronization word 
second object's motion &ilure data 
contour resynchronization word 
third objects contour data 
motion vector resynchronization word 
third objects motion vectors data 
DCT/wavelet resynchronization word 
third object's motion &ilure data 
These resynchronization words also help the decoder in detecting errors. 
Once the decoder detects an error in the received bitstream, it tries to 
find the nearest resynchronization word. Thus the decoder reestablishes 
synchronization at the earliest possible time with a triiriiTriftl loss of coded data. 

An error may be detected at the decoder if any of the following conditions 
is observed: 

15 (i) an invalid codeword is found; 

(ii) an invalid mode is detected while decoding; 

(iii) the resynchronization word does not follow a decoded block of 

data; 

(iv) a motion vector points outside of the firame; 

20 (v) a decoded DCT value lies outside of permissible limits; or 

(vi) the boundary contour is invalid (lies outside of the image). 
If an error is detected in the botmdary contour data, then the contour is 
dropped and is made a part of the backgroimd; this means the corresponding 
region of the previous frame is used. This reduces some distortion because 
25 there often is a lot of temporal correlation in the video sequence. 

If an error is detected in the motion vector data, then the average motion 
vector for the object is applied to the entire object rather than each macroblock 
using its own motion vector. This relies on the fact that there is large spatial 
correlation in a given firame; therefore, most of the motion vectors of a given 
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object are approxiinately the same. Thixs the average motion vector applied to 
the various macroblocks of the object will be a good approximation and help 
reduce visual distortion significantly. 

If an error is detected in the motion failure region DOT data, then all of 
5 the DOT coefficients are set to zero and the decoder attempts to resynchronize. 

Error correction 

The error control code of the preferred embodiments comprises two 
Reed-Solomon (RS) coders with an interleaver in between as illustrated in 
Figure 15a. The bitstream to be transmitted is partitioned into groups of 

10 6 successive bits to form the symbols for the RS coders. This will apply 
generally to transmission over a channel with burst errors in addition to 
random errors. The interleaver mixes up the s3mibols from several codewords 
so that the symbols from any given codeword are well separated during 
transmission. When the codewords are reconstructed by the deinterleaver in 

15 the receiver, error bursts introduced by the channel are effectively broken up 
and spread across sevei^ codewords. The interleaver-deinterleaver pair thus 
transforms burst errors in to effectively random errors. The delay multiplier 
m is chosen so that the overall delay is less than 250 msec. 

Each of the RS coders uses an RS code over the Galois field GF(64) and 

20 maps a block 6-bit information symbols into a larger block of 6-bit codeword 
symbols. The first RS coder codes an input block of k 6-bit information symbols / 
as n2 6-bit symbols and feeds these to the interleaver, and the second RS coder 
takes the output of the interleaver and maps the n2 6-bit symbols into n^ 6-bit 
codeword symbols; n^ - n2 = 4. 

25 At the recdver, each block of nj, 6-bit symbols is fed to a decoder for the 

second coder. This RS decoder, though capable of correcting up to 2 6-bit 
symbol errors, is set to correct single errors only. When it detects any higher • 
nimiber of errors, it outputs n2 erased symbols. The deinterleaver spreads 
these erasures over n2 codewords which are then input to the decoder for the 
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first RS coder. This decoder can correct any rombination of E errors and S 
erasures such that 2E+S <= n2-k. If 2E+S is greater than the above numbei, 
then the data is output as is and the erasures in the data» if any, are noted by 
the decoder. 

5 The performance of the preferred embodiment error-correctii:^ exceeds 

the simple correction so far described by further adding a feedback from the 
second decoder (after the deinterleaver) to the first decoder and thereby 
improve the error correction of the first decoder. In particular, assume that the 
first decoder correct E errors and detects (and erases) T errors. Also presume 

10 the second decoder can correct S erasures in any given block of N2 symbols. 
Further, asstune that at time t the first decoder detects X errors in the input 
block B which consists of 6-bit symbols with X > E; implies a decoding 
failure at time t. This decoding feilure results in the first decoder outputting 
N2 erased symbols. The preferred embodiment error correction system as 

15 illustrated in Figure 15b includes a buffer to store the input block B of N 
symbols and the time t at which the decoding failing occurred; this will be used 
in the feedback described below. The deinterleaver takes the N2 erased symbol 
block output of the first decoder and spreads out the erased symbols over the 
next N2 blocks: one erased symbol per block. Thus the erased symbols from 

20 block B appear at the second decoder at times t, tfd, tf2d, ... tf(N2-l)d where 
d is the delay increment of the deinterleaver and relates to the block length. 

Consider the time t. If the number of erased symbols in the input block 
to the second decoder at time t is less than or equal to S, then the second 
decoder can correct all the erasures in this input block. One of the corrected 

25 erasures derived from the input block B to the first decoder at time t. This 
corrected erasure can be either (1) one of the symbols of the input block B 
which was an error detected by the first decoder or (2) was not-one of the 
symbols in error in block B but was erased due to the decoding failinre. 

Compare the corrected erasure with the contents of the corresponding 

30 locationinblockB which has been stored in the buffer. If the corrected erasure 
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is the same as the corresponding contents of stored hlock then the corrected 
erased symbol was of category (2) and this output of the second decoder is used 
without any modification. However, if the corrected erased symbol does not 
match the contents of the corresponding location in block B, then this 
5 corresponding location symbol was one of the error symbols in block B. Thus 
this error has been corrected by the second decoder and this correction may be 
made in block B as stored in the buffer, that is, an originally uncorrectable 
error in block B for the first decoder has been corrected in the stored copy of 
block B by a feedback firom the second decoder. This reduces the number of 

10 errors X that would be detected by the first decoder if the thus corrected block 
B were again input to the first decoder. Repeat this erasure correcting by the 
second decoder at later times t+id (i= 1, (N2-I)) which correspond to the 
erasiu'es derived from B; this may reduce the number of errors detectable in 
block B to X-Y. Once X-Y is less than E, all of the remaining erroi's in the now 

15 corrected input block B can be corrected, and the deinterleaver may be updated 
with the thus corrected input block B. This reduces the number of erased 
symbols being passed to the second decoder at subsequent times, and thereby 
increasing the overall probability of error correction. Contrarily, if it is not 
possible to correct aU of the errors in the input block B, then the corrections 

20 made by the second decoder are used without modification. Note that if an 
extension of the overall delay were tolerable, then the corrected block B could 
be reinput to the first decoder. 

Simulations show that the foregoing channel coding is capable of 
correcting all burst lengths of duration less than 24 msec at transmission rates 

25 of 24 Kbps and 48 Kbps. 

In the case of random errors of probability 0.001 for choices of (k,n2,n^) 
equal to (24,28,32), (26,30,34), (27,31,34), and (28,32,36) the decoded bit error 
rate was less than ).00000125, 0.000007, and 0.0000285, respectively with 
multiplier m=l. Similarly, for ms2 (38,43,48) may be used. Note that the 
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overall delay depends upon the codeword size due to the interleaver delays. In 
fact, the overall delay is 

delay - (mn2)^6/bitrate 

where the 6 comes from the use of 6-bit symbols and the second power from the 
5 number of symbols in the codewords determines the number of delays and the 
increment between delays. Of course, the number of parity symbols (n^-ng and 
n2-k) used depends upon the bit error rate performance desired and the overall 
delay. 

In our simulations with a bitstream of 3604480, 6-bit symbols, at a 
10 probability of error of le-3, the number of erasures without feedback is 
46/3604480, 6-bit symbols a28e-5). "With feedback, the number of erasures is 
24/3604480, 6-bit symbols (6.66e-6). For the combination of burst error and 
random errors, number of erasures without feedback is 135/3604480 (3.75e-5) 
and with feedback the number of erasures is 118/2703360, 6-bit symbols 
15 (3.27e-5). 

Figures 16a-b are heuristic examples illustrating the feedback error 
correction. In particular, the first row in Figure 16a shows a sequence of 

symbols A1,B1,A2,B2, which would be the information bitstream to be 

transmitted, each symbol would be a group of successive bits, (e.g. 6 bits). For 

20 simplicity of illustration, the first coder is presumed to encode two information 
symbols as a three symbol codeword; i.e., A1,B1, encodes as A1,B1,P1 with PI 
being a parity symbol. This is analogous to the 26 information symbols encoded 
as 30 symbols with 4 parity symbols as in one of the foregoing preferred 
embodiments. The second row of Figure 16a shows the codewords. The 

25 interleaver spreads out the symbols by delays as shown in the second and third 
rows of Figure 16a. In detail the Aj symbols have no delays, the Bj symbols 
have delays of 3 symbols, and the Pj syxnbols have deUys of 6 symbols. The 
slantii^ arrows in Figure 16a indicate the delays. 
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The interleaver output (sequence of 3-s3anbol words) is encoded by the 
second encoding as 4-symbol codewords. The fourth row of Figtire 16a 
illustrates the second encoding of the 3-^ymbol words of the third row by 
adding a parity sjmibol Qj to form a 4-symbol codeword 

Row five of Figure 16a indicates three exemplary transmission errors by 
way the X's over the symbols A3,P1, and B3. Presume for simplicity that the 
decoders can correct one error per codeword or can detect two errors and erase 
the codeword symbols. The row 6 of Figure 16a shows the decoding to correct 
the error in symbol B3 and Eros the A3, B2, PI word as indicated by 0*s over 
the symbols. 

The deinterleaver reassembles the 3-symbol codewords by delays which 
are complementary to the interleaver delays: the Aj symbols have delays of 6 
symbols, the Bj symbols have delays of 3-s3mibols and the ?} symbols have no 
delays. Rows 6-7 the delays with slanting arrows. Note the erased symbols 
spread out in the deinterleaving. 

Figure 16a row 8 illustrates the second decoder correcting the erased 
symbols to recover the A1,B1,A2,B2 information. 

Figure 16b illustrates the same arrangement as Figtire 16a but with an 
additional error which can only be corrected by use of the preferred 
embodiment feedback to the deinterleaver. In particular, row 5 of Figure 16b 
shows 6 errors depicted £is X's over the symbols A2, Bl, A3, PI, B3, and A4. 
In this case the first decoder detects two errors in each of the corresi)onding 
codewords and erases all three errors as illustrated by 0*s over the symbols in 
row 6 of Figure 16b. 

The deinterleaver again reassembles the 3-syxnbol codewords by delays 
which are complementary to the interleaver delays; rows 6-7 of Figure 16b. show 
the delays with slanting arrows. The erased symbols again spread out, but 
three erasures in codeword A2,B2,P2 cannot be corrected. However, the 
codeword Al, Bl, PI with Bl and PI erased can be corrected by the second 
decoder to give the true codeword Al, Bl, PI. Then the true Bl can be 
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compared to the word A2,B1 J'0,Q2 in row 5 and the fiact that Bl differs in this 
word implies that Bl was one of the two errors in this word. Thus the true Bl 
can be used to form a word with only one remaining error (A2) and this word 
error corrected to give the true A2, Bl» PO. This is the feedback: a later error 

5 correction (Bl in this example) is used to make an error correction in a 
previously uncorrected word (which has already been decoded) and then this 
correction of the past also provides a correction of a symbol (A2 in this example) 
for future use: the erased A2 being delayed in the interleaver can be corrected 
to true A2 and reduce the number of errors in the codeword A2» B2, P2 to two. 

10 Thus the codeword A2, B2, P2 can now be corrected. Thus the feedback from 
the Al, Bl, PI correction to the A2, Bl, PO, Q2 decoding led to the correction 
of A2 and then to the possible correction of the codeword A2, B2, P2. Of course, 
the numbers of symbols used and correctable in these examples are heuristic 
and only for simple illustration. 

15 Appendix 

A listing of machine instructions written in the C language for an 
implementation of the foregoing preferred embodiments appears in the attached 
Appendix- 

Modifications 

20 The preferred embodiments may be varied in many ways while retaining 

one or more of their features. For example, the size of blocks, codes, thresholds, 
morphology neighborhoods, quantization levels, symbols, and so forth can be 
changed. Methods such as particular splines, quantization methods, transform 
methods, and so forth can be varied. 
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